1. Purify the genome
Exploring coverage and GC%
- From the sequencing company, got the assembly consisting of 174 contigs with 76,648,537bp and 93.9% compeleteness (BUSCO on chlorophyta_odb10). However, the company also reported assemblies made with less data and of only slightly lower quality on 26 contigs. Could it be that the bigger assembly contains contaminants?
- Will do binning to identify contigs that might be external
- First, use minimap to align the reads onto the assembly
mkdir analysis_and_temp_files/02_genome_annotation
source package c92263ec-95e5-43eb-a527-8f1496d56f1a
source package 222eac79-310f-4d4b-8e1c-0cece4150333
minimap2 -t 20 -a data/your-data_fg25005_2025-03-20_1225/FG25005_01_GTX0536.fasta data/your-data_fg25005_2025-03-20_1225/FG25005_05_PAO94355_20250305_dorado_7.4.13_sup_pass.fastq.gz | samtools sort -@8 -o analysis_and_temp_files/04_genome_annotation/GTX0536.bam
samtools index analysis_and_temp_files/04_genome_annotation/GTX0536.bam
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta
- Use metaBAT, which produced 23 bins
source package 0a2dffce-c151-4379-abe9-866414c91cd7
cp data/your-data_fg25005_2025-03-20_1225/FG25005_01_GTX0536.fasta analysis_and_temp_files/04_genome_annotation/GTX0536.fasta
runMetaBat.sh -t 20 analysis_and_temp_files/04_genome_annotation/GTX0536.fasta analysis_and_temp_files/04_genome_annotation/GTX0536.bam
mv GTX0536.fasta.* analysis_and_temp_files/04_genome_annotation
- Identify prokaryotic MAGs with CheckM (plus calculate coverage depth for each contig and gc%). No plausible bacterial genomes here
source package 5a1c6a9a-f666-4eaa-9409-3e7435d86406
checkm coverage analysis_and_temp_files/04_genome_annotation/GTX0536.fasta.metabat* analysis_and_temp_files/04_genome_annotation/GTX0536.cov analysis_and_temp_files/04_genome_annotation/GTX0536.bam -x fa
sbatch --mem=100G -c 20 --wrap="source package 5a1c6a9a-f666-4eaa-9409-3e7435d86406; checkm lineage_wf analysis_and_temp_files/04_genome_annotation/GTX0536.fasta.metabat* analysis_and_temp_files/04_genome_annotation/GTX0536_checkm -x fa --tab_table > analysis_and_temp_files/04_genome_annotation/GTX0536.checkm"
source package /tsl/software/testing/bin/bbmap-37.90
stats.sh in=analysis_and_temp_files/04_genome_annotation/GTX0536.fasta gc=analysis_and_temp_files/04_genome_annotation/GTX0536.gc gcformat=4
- Visualize binning result. One contig (ptg000005c) had a super high coverage at 2246x, which is probably an organelle genome. The rest are split into two groups at ~36% GC and 50%
gc<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536.gc",header=T)
colnames(gc)[1]<-"contig"
gc$GC<-as.numeric(gc$GC)
cov<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536.fasta.depth.txt")
colnames(cov)[1]<-"contig"
bins<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536.cov")
colnames(bins)[1]<-"contig"
df<-left_join(gc,cov) %>% left_join(bins)
## Joining with `by = join_by(contig)`
## Joining with `by = join_by(contig)`
df$totalAvgDepth <- as.numeric(df$totalAvgDepth )
ggplot(df)+geom_point(aes(x=GC,y=totalAvgDepth,color=Bin.Id))

- Let’s remove ptg000005c:
- This highlighted another contig ptg000019l at 390x: could be another organelle genome, but is >2Mbp. Maybe a part of the nuclear genome?
- The group at 50% GC and 200x has longest contigs. Most likely, these correspond to the nuclear genome
library(patchwork)
g1<-ggplot(df %>% filter(contig!="ptg000005c"))+geom_point(aes(x=GC,y=totalAvgDepth,color=Bin.Id))
g2 <- ggplot(df %>% filter(contig!="ptg000005c"))+geom_point(aes(x=GC,y=totalAvgDepth,color=Length))
g1/g2

- Lets look closer at this region. It contained 22 contigs, most from bins 9 and 7. bins 2, 14, and 22 (just one contig each) are also very clearly in the same cloud. bins 13 and 18 (again just one contig each; coverage in the range 190-200) are probably also the part of the genome with the length of >2Mbp. Bin 18 (coverage 181x) is slightly below 1Mbp. The lowest three contigs in coverage are also the shortest (58 kbp to 190 kbp)
g1<-ggplot(df %>% filter(GC >0.45,totalAvgDepth > 150, totalAvgDepth <300))+ geom_point(aes(x=GC,y=totalAvgDepth,color=Bin.Id))
g2 <- ggplot(df %>% filter(GC >0.45,totalAvgDepth > 150, totalAvgDepth <300))+ geom_point(aes(x=GC,y=totalAvgDepth,color=Length))
g1+g2

Blasting against NCBI
- Look at ten matches per contig
source package /tsl/software/testing/bin/blast+-2.9.0
blastn \
-query analysis_and_temp_files/04_genome_annotation/GTX0536.fasta \
-db /tsl/data/ncbi_database/blast/nt_20220819/nt \
-outfmt '6 qseqid staxids bitscore std' \
-max_target_seqs 10 \
-max_hsps 1 \
-evalue 1e-25 \
-out analysis_and_temp_files/04_genome_annotation/GTX0536.blast.out -num_threads 20
- Go taxonomy for each hit, and identified the hits that are labelled as mitochondrial
library(taxize)
#get taxid for each hit based on its accession ID
blast_results<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536.blast.out",header=F)
get_taxid <- function(x){
list <- genbank2uid(id = x, key="86df5e42e9751ae03c5e114326c1740ef008")
list2 <- lapply(list,attributes)
df <- data.frame("accession"=x,
"taxid"=list[[1]][1],
"description"=list2[[1]]$name)
return(df)}
l<-lapply(blast_results$V5,get_taxid)
acc_taxid <- do.call(rbind,l)
acc_taxid <- acc_taxid %>% filter(!is.na(taxid),taxid!="") %>% distinct()
#get taxonomy for each taxid
get_taxonomy<- function(x){
df <- classification(x,db="ncbi",key="86df5e42e9751ae03c5e114326c1740ef008")[[1]]
df <- df %>% filter(rank %in% c("domain","kingdom","phylum","class")) %>% select(-id) %>% mutate(taxid=x)
return(df)}
l2<-lapply(unique(acc_taxid$taxid) ,get_taxonomy)
taxonomy <- do.call(rbind,l2) %>% pivot_wider(names_from = rank,values_from=name)
#combine all data together
blast <- blast_results %>% select(V1,V5)
colnames(blast) <- c("contig","accession")
blast2 <- acc_taxid %>% left_join(taxonomy) %>% left_join(blast) %>% distinct()
#save the table so that I don't need to re-run the ncbi step
write.table(blast2, "../analysis_and_temp_files/02_genome_annotation/GTX0536.blast.taxonomy.txt",col.names = T, row.names = F, quote = F,sep="\t")
- This confirms that the sequences in the cluster with bin 7 and bin 9 are the nuclear genome
- Plastid genome is ptg000005c
- ptg000019l also has plastid hits, which is weird given that it’s linear and >2Mbp, will keep an eye on that
- Mitochondrial genome is the cluster at 35 GC%, and it clearly failed to assemble, which is common. Will re-assemble it later on with a specialized software
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggforce)
blast<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536.blast.taxonomy.txt",header=T)
blast<- blast %>% mutate(label=case_when(
is.na(domain) ~ "Virus",
phylum=="Chlorophyta" & grepl("mitochon",description,) ~ "Chlorophyta mitochondrial",
phylum=="Chlorophyta" & grepl("chloroplast",description) ~ "Chlorophyta plastid",
phylum=="Chlorophyta" & ! grepl("chloroplast",description) & !grepl("mitochon",description) ~ "Chlorophyta nuclear",
phylum %in% c("Streptophyta","Rhodophyta") ~ "other Archaeplastida",
phylum %in% c("Annelida" ,"Arthropoda","Chordata","Nematoda","Bryozoa")~ "Animals"
))
blast_summary<-blast %>% filter(!is.na(label)) %>%
group_by(label,contig) %>% summarize(n=n()) %>%
ungroup() %>% group_by(contig) %>% top_n(n=1)
## `summarise()` has grouped output by 'label'. You can override using the
## `.groups` argument.
## Selecting by n
df<- blast_summary %>% left_join(df)
## Joining with `by = join_by(contig)`
ggplot(df)+geom_point(aes(x=GC,y=totalAvgDepth,color=label))+
facet_zoom(ylim = c(0, 400))

ggsave("../results/gc_cov.pdf",width=9,height=4)
Align to the genome from ASG
source package 222eac79-310f-4d4b-8e1c-0cece4150333
minimap2 -x asm20 -t 10 --secondary=no data/SL0000003.fasta analysis_and_temp_files/04_genome_annotation/GTX0536.fasta > analysis_and_temp_files/04_genome_annotation/GTX0536_SL0000003.paf
- Alignment overall looks reasonable (here showing only stretches >20kbp with quality >40)
library(pafr)
paf<-read_paf("../analysis_and_temp_files/02_genome_annotation/GTX0536_SL0000003.paf")
paf_filtered <-subset(paf, alen > 20000 & mapq > 40)
paf_filtered <-paf_filtered %>% distinct()
align<-paf_filtered %>% group_by(qname,tname) %>% summarize(total_alen = sum(alen)) %>%
pivot_wider(names_from = tname,values_from = total_alen,values_fill=0)
## `summarise()` has grouped output by 'qname'. You can override using the
## `.groups` argument.
dotplot(paf_filtered,order_by="qstart",label_seqs=F,dashes=F) + theme_bw()

- Checking which contigs mapped to which in the two assembly, see clear ‘partner’ in almost all cases.
- Exception 1: ptg000002l has about the same total length of matches with OZ234913.1 and OZ234917.1
- Exception 2: ptg000023l and ptg000065l are partnerless
align<-paf_filtered %>% group_by(qname,tname) %>% summarize(total_alen = sum(alen)) %>%
pivot_wider(names_from = tname,values_from = total_alen,values_fill=0)
## `summarise()` has grouped output by 'qname'. You can override using the
## `.groups` argument.
align %>% kable(format = "html", col.names = colnames(align)) %>%
kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "400px")
|
qname
|
OZ234907.1
|
OZ234912.1
|
OZ234914.1
|
OZ234913.1
|
OZ234916.1
|
OZ234917.1
|
OZ234918.1
|
OZ234910.1
|
OZ234919.1
|
OZ234923.1
|
OZ234921.1
|
OZ234924.1
|
OZ234920.1
|
OZ234922.1
|
OZ234909.1
|
OZ234908.1
|
OZ234925.1
|
OZ234906.1
|
OZ234915.1
|
OZ234911.1
|
|
ptg000001l
|
168424
|
51747
|
10054922
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000002l
|
0
|
0
|
0
|
12861796
|
122964
|
17901346
|
279330
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000003l
|
0
|
12682960
|
431478
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000004l
|
40818
|
0
|
450840
|
0
|
0
|
0
|
0
|
129294
|
13413450
|
69279
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000006l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
15188720
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000007l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1129727
|
0
|
0
|
0
|
5368634
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000008l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
4676922
|
155382
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000009l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
806364
|
0
|
0
|
0
|
0
|
0
|
0
|
20091160
|
0
|
0
|
0
|
0
|
0
|
|
ptg000010l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
18421717
|
0
|
0
|
0
|
0
|
|
ptg000011l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
9330528
|
0
|
0
|
0
|
|
ptg000012l
|
634403
|
0
|
318467
|
139290
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
26933188
|
0
|
0
|
|
ptg000013l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1383312
|
0
|
0
|
0
|
326885
|
0
|
0
|
0
|
0
|
0
|
0
|
17659341
|
0
|
|
ptg000014l
|
0
|
0
|
376640
|
0
|
12250144
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000015l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
11397846
|
0
|
0
|
0
|
665886
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000016l
|
25499502
|
540889
|
0
|
0
|
0
|
0
|
0
|
0
|
276272
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000017l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
139272
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
17326658
|
|
ptg000018l
|
0
|
0
|
130098
|
0
|
0
|
0
|
14381475
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000019l
|
0
|
0
|
472765
|
0
|
0
|
0
|
0
|
0
|
0
|
10603302
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000020l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
123438
|
0
|
8525180
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000023l
|
0
|
455544
|
164936
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
|
ptg000065l
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
117952
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
- Let’s picture syntheny for all partners
ptg000001l<-plot_synteny(paf_filtered, q_chrom="ptg000001l", t_chrom="OZ234914.1", centre=F)
ptg000002l<-plot_synteny(paf_filtered, q_chrom="ptg000002l", t_chrom="OZ234913.1", centre=F)
ptg000002l_2<-plot_synteny(paf_filtered, q_chrom="ptg000002l", t_chrom="OZ234917.1", centre=F)
ptg000003l<-plot_synteny(paf_filtered, q_chrom="ptg000003l", t_chrom="OZ234912.1", centre=F)
ptg000004l<-plot_synteny(paf_filtered, q_chrom="ptg000004l", t_chrom="OZ234919.1", centre=F)
ptg000006l<-plot_synteny(paf_filtered, q_chrom="ptg000006l", t_chrom="OZ234921.1", centre=F)
ptg000007l<-plot_synteny(paf_filtered, q_chrom="ptg000007l", t_chrom="OZ234924.1", centre=F)
ptg000008l<-plot_synteny(paf_filtered, q_chrom="ptg000008l", t_chrom="OZ234920.1", centre=F)
ptg000009l<-plot_synteny(paf_filtered, q_chrom="ptg000009l", t_chrom="OZ234909.1", centre=F)
ptg000010l<-plot_synteny(paf_filtered, q_chrom="ptg000010l", t_chrom="OZ234908.1", centre=F)
ptg000011l<-plot_synteny(paf_filtered, q_chrom="ptg000011l", t_chrom="OZ234925.1", centre=F)
ptg000012l<-plot_synteny(paf_filtered, q_chrom="ptg000012l", t_chrom="OZ234906.1", centre=F)
ptg000013l<-plot_synteny(paf_filtered, q_chrom="ptg000013l", t_chrom="OZ234915.1", centre=F)
ptg000014l<-plot_synteny(paf_filtered, q_chrom="ptg000014l", t_chrom="OZ234916.1", centre=F)
ptg000015l<-plot_synteny(paf_filtered, q_chrom="ptg000015l", t_chrom="OZ234910.1", centre=F)
ptg000016l<-plot_synteny(paf_filtered, q_chrom="ptg000016l", t_chrom="OZ234907.1", centre=F)
ptg000017l<-plot_synteny(paf_filtered, q_chrom="ptg000017l", t_chrom="OZ234911.1", centre=F)
ptg000018l<-plot_synteny(paf_filtered, q_chrom="ptg000018l", t_chrom="OZ234918.1", centre=F)
ptg000019l<-plot_synteny(paf_filtered, q_chrom="ptg000019l", t_chrom="OZ234923.1", centre=F)
ptg000020l<-plot_synteny(paf_filtered, q_chrom="ptg000020l", t_chrom="OZ234922.1", centre=F)
ptg000001l/(ptg000002l+ptg000002l_2)/ptg000003l/ptg000004l/ptg000006l/ptg000007l/ptg000008l/ptg000009l/ptg000010l/ptg000011l/ptg000012l/ptg000013l/ptg000014l/ptg000015l/ptg000016l/ptg000017l/ptg000018l/ptg000019l/ptg000020l

- The remaining two pairings do not seem legit
ptg0000023l<-plot_synteny(paf_filtered, q_chrom="ptg000023l", t_chrom="OZ234912.1", centre=TRUE)
ptg0000065l<-plot_synteny(paf, q_chrom="ptg000065l", t_chrom="OZ234921.1", centre=TRUE)
ptg0000023l/ptg0000065l

- Will nonetheless include contig 23
- Saved the 20 contigs as
analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta
write.table(align$qname[1:19], "../analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.txt",col.names = F, row.names = F, quote = F,sep="\t")
source package 1041444f-cd25-4107-a5c7-5e86cb1728fe
seqkit grep -i -f analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.txt analysis_and_temp_files/02_genome_annotation/GTX0536.fasta > analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta
Spotted a problem while visualizing GC% and telomeres
- Used script from Markus Hiltunen
- As a query used “CCCTAAA”, which is a telomeric repeat conserved between green algae and land plants (see Fulnekova et al. 2012)
- 12 contigs have telomeres on both ends
- 5 have on one end
- 2 had none
python code/detect_telomers.py analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta -m "CCCTAAA" > analysis_and_temp_files/02_genome_annotation/GTX0536_telomere_detection.txt
- Used SEQtk and got GC% for non-overlapping windows of 1000 bp
source package 46a62eca-4f8f-45aa-8cc2-d4efc99dd9c6
seqkit sliding analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta -s 1000 -W 1000 | seqkit fx2tab -n -g > analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta_GC_sliding.txt
- Everything looks fine, except for the low GC% portion of ptg000019l. That’s the same part of the contig that had hits to plastid genomes, and the same that wasn’t aligned to a contig in the ASG genome. Could this be an misassembly?
library(stringr)
library(viridis)
## Loading required package: viridisLite
gc<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta_GC_sliding.txt",header=F)[,c(1,2)]
colnames(gc)<-c("window","gc_content")
##get contig name and start of the window
gc$contig<-sub("_sliding.*", "", gc$window)
gc$window<-sub(".*:", "", gc$window)
gc$window_start<-sub("-.*", "", gc$window) %>% as.numeric()
gc$gc_content<-gc$gc_content %>% as.numeric()
##add data for telomere annotation
tel<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536_telomere_detection.txt",header=F)[,c(1,2)]
colnames(tel)<-c("contig","position")
tel_start_contig_list<-tel[tel$position=="forward",1] #list all contigs that have telomer at the contig start (corresponds to 'forward')
tel_end_contig_list<-tel[tel$position=="reverse",1] #list all contigs that have telomer at the contig start (corresponds to 'forward')
## get
tel_start<-gc %>% select(contig,window_start) %>% mutate(telomere="absent") %>% group_by(contig) %>% arrange(window_start) %>% filter(row_number()<25 & contig %in% tel_start_contig_list) %>% mutate(telomere="present") %>% ungroup()
tel_end<-gc %>% select(contig,window_start) %>% mutate(telomere="absent") %>% group_by(contig) %>% arrange(window_start) %>%
slice(tail(row_number(), 25)) %>% filter(contig %in% tel_end_contig_list) %>% mutate(telomere="present") %>% ungroup()
gc<-gc %>% left_join(rbind(tel_start,tel_end))
## Joining with `by = join_by(contig, window_start)`
gc$telomere[is.na(gc$telomere)]<-"absent"
##visualize
ggplot(gc)+
geom_tile(aes(y=fct_reorder(contig,window_start),x=window_start,color=gc_content))+
geom_tile(aes(y=fct_reorder(contig,window_start),x=window_start,alpha=telomere),fill="red")+
xlab("")+ylab("")+
scale_alpha_discrete(range=c(0,1))+
scale_color_viridis(begin = 1,end = 0)+
scale_x_continuous(breaks = c(0,1000000,2000000,3000000,4000000),
labels = c("0","1 Mbp","2 Mbp","3 Mbp","4 Mbp"))+
theme_minimal()
## Warning: Using alpha for a discrete variable is not advised.

Examine read alignment to detected and fix misassemblies
- Looked at the bam file around the ptg000019l:2410001-2420000 window, where the shift from ~50% GC to ~35% happens
- Found a clean break at 2,410,725 bp. Clearly, the end portion of the contig comes from the cloroplast, which also explains why the average coverage of this contig was higher than the rest of the nuclear genome. Will remove the chloroplast mathcing portion
knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000019l.png")

- We can exclude that the same problem happened to other contigs, since they have very similar coverage to each other (192-210x), none had hits to chloroplast ,and none had more than one hit to mitochondrion
- While we are at it, let’s look at ptg000002l:2972181-3246738, which is the border of the parts of the contig aligned to OZ234917.1 and OZ234913.1. Could this also be a misassembly?
- To simplify searching, let’s calculate the depth in non-overlapping 1000bp windows for a larger region 2,750,000-3,500,000
for i in {2750..3500}; do s=$(($i*1000)); e=$(($i*1000+1000)); samtools coverage -r ptg000002l:$s-$e analysis_and_temp_files/02_genome_annotation/GTX0536.bam; done > analysis_and_temp_files/02_genome_annotation/ptg000002l.cov
grep "ptg000002l" analysis_and_temp_files/02_genome_annotation/ptg000002l.cov > analysis_and_temp_files/02_genome_annotation/ptg000002l.filtered.cov
- Coverage is uneven here, but there are no clear changes in GC%. The window with the lowest coverage still shows to ha
depth<-read.delim2("../analysis_and_temp_files/02_genome_annotation/ptg000002l.filtered.cov",header=F)[,c(2,7)]
colnames(depth)<-c("window","depth")
depth$depth <- as.numeric(depth$depth)
gc$window<-as.numeric(gc$window_start-1)
x<-gc %>% filter(contig=="ptg000002l",window>=2750000,window<=3500000) %>% left_join(depth) %>%
select(window,gc_content,depth) %>%
pivot_longer(-window,names_to = "statistic",values_to = "n") %>%
filter(!is.na(n))
## Joining with `by = join_by(window)`
ggplot(x,aes(x=window,y=n))+ geom_col()+
facet_wrap(~statistic,scales = "free_y",ncol=1)

- Looked in IGV at the portion with the most drastic coverage jump (417x in the window 3,024,000 to 100x in 3,025,000), but it still has reasonable coverage and no obvious breaks, as well as the flanking regions
knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000002.png")

- Can we find telomeres in this region?
- Nope! got a few hits, but those are not clustered and are likely by chance
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta ptg000002l:2750000-3500000 > analysis_and_temp_files/02_genome_annotation/ptg000002l_suspect.fa
grep "CCCTAAACCCTAAA" -n analysis_and_temp_files/02_genome_annotation/ptg000002l_suspect.fa
grep "GGGTTTAGGGTTT" -n analysis_and_temp_files/02_genome_annotation/ptg000002l_suspect.fa
- Can we find telomeric repeats in the middle of other contigs? Searched the entire assembly
seqkit locate -i -p "CCCTAAACCCTAAACCCTAAACCCTAAACCCTAAA" analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear.fasta > analysis_and_temp_files/02_genome_annotation/GTX0536_telomeric_repeats_across.txt
- Got one short match in the middle of ptg000002l
- Back-to-back reverse and forward repeats in the middle of ptg000003l
- one area in ptg000004l, ptg000008l, and ptg000009l; all reverse
- Several areas, both forward and reverse in ptg000019l
library(ivs)
repeats<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536_telomeric_repeats_across.txt",header=T)
colnames(repeats)[1]<-"contig"
repeats<- repeats %>% left_join(df %>% select(contig,Length))
## Joining with `by = join_by(contig)`
#remove hits that are closer than 1,500 bp tp a contig end
repeats2 <- repeats %>% mutate(Length_2000 = Length-2000) %>%
filter((strand=="+" & start>2000) | (strand=="-" & end<Length_2000))
repeats3 <- repeats2 %>%
group_by(contig,strand) %>%
mutate(group = iv_identify_group(iv(start, end))) %>%
group_by(group, .add = TRUE) %>%
summarise(start = min(iv_start(group)),
end = max(iv_end(group))) %>%
select(contig,strand,start,end)
## `summarise()` has grouped output by 'contig', 'strand'. You can override using
## the `.groups` argument.
repeats3%>% kable(format = "html", col.names = colnames(repeats3)) %>%
kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "300px")
|
contig
|
strand
|
start
|
end
|
|
ptg000002l
|
|
2984188
|
2984222
|
|
ptg000003l
|
|
3706286
|
3706481
|
|
ptg000003l
|
|
3706488
|
3706613
|
|
ptg000003l
|
|
3705509
|
3705634
|
|
ptg000003l
|
|
3705641
|
3706235
|
|
ptg000004l
|
|
3026801
|
3026877
|
|
ptg000004l
|
|
3026911
|
3027428
|
|
ptg000008l
|
|
1708678
|
1708726
|
|
ptg000009l
|
|
3382353
|
3382401
|
|
ptg000009l
|
|
3976044
|
3976085
|
|
ptg000019l
|
|
8679
|
8734
|
|
ptg000019l
|
|
12336
|
12391
|
|
ptg000019l
|
|
15661
|
15695
|
|
ptg000019l
|
|
2409990
|
2410066
|
|
ptg000019l
|
|
2410104
|
2410145
|
|
ptg000019l
|
|
2410152
|
2410718
|
Identified several irregularity, where at the end of a contig there is a break. To remedy this, will remove the ‘tails’. From ptg000003l remove bases after 3706235
knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000003l_3704286-3708286.png")

- From ptg000004l remove bases after 3027428
knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000004l_3026090-3030090.png")

- All other matches to telomeric repeats were not associated with assembly breaks
analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/
- To be safe, examined other end of contigs that lack telomeres to exclude similar problems. Found that beginning of contig 8 suffers from the same problem. Will remove first 1035 bp
knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000008l_1-4000.png")

- Will remove bases past 4,931,045 from ptg000016l
knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000016l_4929045-4931045.png")

- Will remove bases past 2,339,172 ptg000020l and before 1071
knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000020l_1-4000.png")

knitr::include_graphics("../analysis_and_temp_files/02_genome_annotation/checking_for_misassemblies/ptg000020l_2335936-2339936.png")

Scafolding of the ‘tail’ of ptg000003l? Didn’t work
- Unlike other ‘tails’ which short, the tail of ptg000003l is (3706236 to 3719598) is 13362 bp, and it also seem to contain a forward telomeric repeat. Could it be a fragment of a different contig?
- Looking at the existing paf file, this region only got hit to middle of two different contigs, each <1500bp
subset(paf, qname=="ptg000003l" & qend>3706236 & mapq > 40) %>%
select(qname,qstart,qend,tname,tstart,tend) %>% distinct()
## pafr object with 2 alignments (0Mb)
## 1 query seqs
## 2 target seqs
## -6 tags:
- Aligning just the tail against the SL0000003 genome, didn’t get any hits with better quality
- Aligning it against the GTX0536 assembly from which the tail is removed, it maps back on to the same contig but in reverse strand
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000003l:3706236-3719598 > analysis_and_temp_files/02_genome_annotation/ptg000003l_tail.fa
blastn -query analysis_and_temp_files/02_genome_annotation/ptg000003l_tail.fa -subject data/SL0000003.fasta -outfmt 6
minimap2 -x asm20 -t 10 --secondary=no data/SL0000003.fasta analysis_and_temp_files/02_genome_annotation/ptg000003l_tail.fa > analysis_and_temp_files/02_genome_annotation/ptg000003l_tail_SL0000003.paf
ptg000003l:3706236-3719598 13363 4020 5359 - OZ234919.1 2842224 730997 732352 615 1378 60 tp:A:P cm:i:69 s1:i:597 s2:i:280 dv:f:0.0078 rl:i:414
ptg000003l:3706236-3719598 13363 1660 2765 - OZ234917.1 2908467 564827 565829 283 1127 60 tp:A:P cm:i:18 s1:i:252 s2:i:0 dv:f:0.0081 rl:i:414
ptg000003l:3706236-3719598 13363 3349 3833 - OZ234916.1 3059195 732081 732606 106 525 17 tp:A:P cm:i:8 s1:i:98 s2:i:85 dv:f:0.0166 rl:i:414
ptg000003l:3706236-3719598 13363 672 1119 + OZ234913.1 3320455 426382 426832 57 453 0 tp:A:P cm:i:3 s1:i:54 s2:i:46 dv:f:0.0503 rl:i:414
ptg000003l:3706236-3719598 13363 570 858 + OZ234922.1 2422596 249762 250050 46 288 5 tp:A:P cm:i:3 s1:i:46 s2:i:0 dv:f:0.0562 rl:i:414
minimap2 -x asm20 -t 10 analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta analysis_and_temp_files/02_genome_annotation/ptg000003l_tail.fa
ptg000003l:3706236-3719598 13363 233 13353 - ptg000003l:1-3706235 3706235 3692266 3705653 7673 13430 60 tp:A:P cm:i:1169 s1:i:7578 s2:i:290 dv:f:0.0002 rl:i:945
Finalized the genome
- Made a new fasta file where portion of contigs outlined above are truncated
seqkit grep -i -f analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.txt analysis_and_temp_files/02_genome_annotation/GTX0536.fasta > analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000019l:1-2410725 >> analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000003l:1-3706235 >> analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000004l:1-3027428 >> analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000008l:1035-2212220 >> analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000016l:1-4931045 >> analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000020l:1071-2339172 >> analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta
- At a later stage, added another contig (ptg000023l), after it became clear that it contained GEX1 gene that wasn’t found elsewhere in the genome, but was present in all other Trebouxia genomes (minus SL0003 genome)
cp analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final2.fasta
source package aeee87c4-1923-4732-aca2-f2aff23580cc
samtools faidx analysis_and_temp_files/02_genome_annotation/GTX0536.fasta ptg000023l >> analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final2.fasta
- Cleaned and sorted the assembly using funannotate
singularity run ../singularity/funannotate.sif funannotate sort -i analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final2.fasta -o analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final_sort2.fasta -b GTX0536
Check completeness scores
- BUSCO score: C:96.5%[S:95.9%,D:0.6%],F:0.7%,M:2.8%,n:1519
(BUSCO v4.0.6, chlorophyta_odb10)
mkdir analysis_and_temp_files/02_genome_annotation/GTX0536_busco2
source package ca890cd7-f81d-4c22-9f4a-5b40ab671c79
source package 85f2de80-4bd0-48dc-9303-bba1a19206e4
export AUGUSTUS_CONFIG_PATH=../02_long_read_assemblies/analysis_and_temp_files/02_binning/tmp_augustus/config
busco -i analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final2.fasta -o GTX0536 --out_path analysis_and_temp_files/02_genome_annotation/GTX0536_busco2 -m genome -l /tsl/data/busco_lineages/chlorophyta_odb10 --offline -f -c 20
- For comparison, analyzed the ASG genome, and got slightly worse results
- C:92.3%[S:91.6%,D:0.7%],F:0.5%,M:7.2%,n:1519
busco -i data/SL0000003.fasta -o SL0000003 --out_path analysis_and_temp_files/02_genome_annotation/SL0000003_busco -m genome -l /tsl/data/busco_lineages/chlorophyta_odb10 --offline -f -c 20
- Visualize completeness scores
library(geomtextpath)
busco1<-read.delim("../analysis_and_temp_files/02_genome_annotation/GTX0536_busco/GTX0536/short_summary.specific.chlorophyta_odb10.GTX0536.txt",header=F,skip = 9)
busco1$type<-c("Complete (95.9%)","Duplicated (0.6%)","Fragmented (0.7%)","Missing (2.8%)","Total")
busco1<-busco1[1:4,]
hsize <- 1.5
busco1$x<-hsize
b1<-ggplot(busco1,aes(y=V2,fill=type,x=hsize))+geom_col()+
coord_curvedpolar(theta = "y")+ xlim(c(0.2, hsize + 0.5))+
scale_fill_manual(values=c("Complete (95.9%)"="#60ba30","Duplicated (0.6%)"="#1f5900","Fragmented (0.7%)"="#ccf2b8","Missing (2.8%)"="white"))+
theme_void()+theme(legend.title = element_blank(),legend.text = element_text(size=9),
legend.key=element_rect(colour="#969696"))+
ggtitle("GTX0536")
busco2<-read.delim("../analysis_and_temp_files/02_genome_annotation/SL0000003_busco/SL0000003/short_summary.specific.chlorophyta_odb10.SL0000003.txt",header=F,skip = 9)
busco2$type<-c("Complete (91.6%)","Duplicated (0.7%)","Fragmented (0.5%)","Missing (7.2%)","Total")
busco2<-busco2[1:4,]
hsize <- 1.5
busco2$x<-hsize
b2<-ggplot(busco2,aes(y=V2,fill=type,x=hsize))+geom_col()+
coord_curvedpolar(theta = "y")+ xlim(c(0.2, hsize + 0.5))+
scale_fill_manual(values=c("Complete (91.6%)"="#60ba30","Duplicated (0.7%)"="#1f5900","Fragmented (0.5%)"="#ccf2b8","Missing (7.2%)"="white"))+
theme_void()+theme(legend.title = element_blank(),legend.text = element_text(size=9),
legend.key=element_rect(colour="#969696"))+
ggtitle("SL0000003")
b1+b2

ggsave("../results/busco.pdf",plot=b1,width=3,height=3)
Visualizing GC% and telomeres again
python code/detect_telomers.py analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final_sort2.fasta -m "CCCTAAA" > analysis_and_temp_files/02_genome_annotation/GTX0536_final_telomere_detection2.txt
seqkit sliding analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final_sort2.fasta -s 1000 -W 1000 | seqkit fx2tab -n -g > analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta_GC_sliding2.txt
- Now, got both telomeres on 16 contigs and singe telomere on the remaining 3
gc<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final.fasta_GC_sliding2.txt",header=F)[,c(1,4)]
colnames(gc)<-c("window","gc_content")
##get contig name and start of the window
gc$contig<-sub("_sliding.*", "", gc$window)
gc$window<-sub(".*:", "", gc$window)
gc$window_start<-sub("-.*", "", gc$window) %>% as.numeric()
gc$gc_content<-gc$gc_content %>% as.numeric()
##add data for telomere annotation
tel<-read.delim2("../analysis_and_temp_files/02_genome_annotation/GTX0536_final_telomere_detection2.txt",header=F)[,c(1,2)]
colnames(tel)<-c("contig","position")
tel_start_contig_list<-tel[tel$position=="forward",1] #list all contigs that have telomer at the contig start (corresponds to 'forward')
tel_end_contig_list<-tel[tel$position=="reverse",1] #list all contigs that have telomer at the contig start (corresponds to 'forward')
## get
tel_start<-gc %>% select(contig,window_start) %>% mutate(telomere="absent") %>% group_by(contig) %>% arrange(window_start) %>% filter(row_number()<50 & contig %in% tel_start_contig_list) %>% mutate(telomere="present") %>% ungroup()
tel_end<-gc %>% select(contig,window_start) %>% mutate(telomere="absent") %>% group_by(contig) %>% arrange(window_start) %>%
dplyr::slice(tail(row_number(), 50)) %>% filter(contig %in% tel_end_contig_list) %>% mutate(telomere="present") %>% ungroup()
gc<-gc %>% left_join(rbind(tel_start,tel_end))
## Joining with `by = join_by(contig, window_start)`
gc$telomere[is.na(gc$telomere)]<-"absent"
##visualize
ggplot(gc)+
geom_tile(aes(y=fct_reorder(contig,window_start),x=window_start,color=gc_content,height = 0.8))+
geom_tile(aes(y=fct_reorder(contig,window_start),x=window_start,alpha=telomere,height=0.9),fill="red")+
xlab("")+ylab("")+
scale_alpha_discrete(range=c(0,1))+
scale_color_viridis(begin = 1,end = 0)+
scale_x_continuous(breaks = c(0,2000000,4000000,6000000),
labels = c("0","2 Mbp","4 Mbp","6 Mbp"))+
theme_minimal()
## Warning: Using alpha for a discrete variable is not advised.

ggsave("../results/genome.pdf",width=5,height=4)
Visualize syntheny between the finalized genome and ASG
- Re-ran minimap alignment on the finalized genome
minimap2 -x asm20 -t 10 analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final_sort2.fasta data/SL0000003.fasta > analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final2_SL0000003.paf
- Get info on contig lengths for both files
seqkit fx2tab --length --name --header-line data/SL0000003.fasta > analysis_and_temp_files/02_genome_annotation/SL0000003.lengths.txt
seqkit fx2tab --length --name --header-line analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final_sort2.fasta > analysis_and_temp_files/02_genome_annotation/GTX0536.2.lengths.txt
library(syntenyPlotteR.BETA)
#prep inputs
paf2<-read_paf("../analysis_and_temp_files/02_genome_annotation/GTX0536_nuclear_final2_SL0000003.paf")
paf2_filtered <-subset(paf2, alen > 20000 & mapq > 40)
paf2_filtered <-paf2_filtered %>% distinct()
paf2_filtered <- paf2_filtered[,c(6,8,9,1,3,4,5)]
paf2_filtered$tname <- str_replace(paf2_filtered$tname,"GTX0536_","")
paf2_filtered$qname <- str_replace(paf2_filtered$qname,"OZ2349","")
paf2_filtered$qname <- str_replace(paf2_filtered$qname,"\\.1","")
paf2_filtered$q <- "GTX0536"
paf2_filtered$t<- "SL0000003"
write.table(paf2_filtered,"../analysis_and_temp_files/02_genome_annotation/synteny_paf.txt",quote = F, col.names = F, row.names = F,sep="\t")
size1 <- read.delim("../analysis_and_temp_files/02_genome_annotation/GTX0536.2.lengths.txt")[,c(1,4)] %>% mutate(id="GTX0536")
size2 <- read.delim("../analysis_and_temp_files/02_genome_annotation/SL0000003.lengths.txt") %>% mutate(id="SL0000003")
size2$X.name <- gsub( " .*$", "", size2$X.name)
size<-rbind(size2,size1)
#order contigs
target <- c("OZ234917.1","OZ234913.1","OZ234906.1","OZ234907.1","OZ234909.1",
"OZ234908.1","OZ234911.1","OZ234915.1","OZ234912.1","OZ234910.1",
"OZ234916.1","OZ234914.1","OZ234918.1","OZ234919.1","OZ234921.1",
"OZ234924.1","OZ234923.1","OZ234922.1","OZ234920.1","OZ234925.1",
size1$X.name)
size<-size[match(target, size$X.name),]
size$X.name <- str_replace(size$X.name,"GTX0536_","")
size$X.name <- str_replace(size$X.name,"OZ2349","")
size$X.name <- str_replace(size$X.name,"\\.1","")
write.table(size,"../analysis_and_temp_files/02_genome_annotation/synteny_size.txt",quote = F, col.names = F, row.names = F,sep="\t")
draw.linear.2.0("synteny",
"../analysis_and_temp_files/02_genome_annotation/synteny_size.txt",
"../analysis_and_temp_files/02_genome_annotation/synteny_paf.txt",
h=1.5, w=6 , directory="../results",
angle.chr.label = 0,chr.label.height = 0.2,insert.size = 600000,
chr.label.size = 3,sps.label.size = 4)
## Saving linear image to ../results

draw.linear.2.0("synteny",
"../analysis_and_temp_files/02_genome_annotation/synteny_size.txt",
"../analysis_and_temp_files/02_genome_annotation/synteny_paf.txt",
h=1.5, w=6 , directory="../results",fileformat = "pdf",
angle.chr.label = 0,chr.label.height = 0.2,insert.size = 600000,
chr.label.size = 3,sps.label.size = 4)
## Saving linear image to ../results

Conclusions so far:
- Will treat the contigs ptg000001l-ptg000004l and ptg000006l-ptg000020l as the core genome. Several of the contigs required trimming of ends
- In total, have 20 contigs that have high synteny with a recently published Trebouxia genome from ASG (all but one)
- Recovered plastid genome as a single contig with 320,516 bp
- Mitochondria genome will require re-assembly
5. Nuclear genome of SL0000003
- For comparison, let’s annotates the SL0000003 genome too
5.1 Visualize
- First visualize telomeres and GC% in the same way as above
python code/detect_telomers.py data/SL0000003.fasta -m "CCCTAAA" > analysis_and_temp_files/02_genome_annotation/SL0000003_telomere_detection.txt
seqkit sliding data/SL0000003.fasta -s 1000 -W 1000 | seqkit fx2tab -n -g > analysis_and_temp_files/02_genome_annotation/SL0000003.fasta_GC_sliding.txt
- Here, out of 20 contigs: 3 got both telomeres, 10 got a singe telomere, and 7 got no telomeres
gc<-read.delim2("../analysis_and_temp_files/02_genome_annotation/SL0000003.fasta_GC_sliding.txt",header=F)[,c(1,2)]
colnames(gc)<-c("window","gc_content")
##get contig name and start of the window
gc$contig<-sub("_sliding.*", "", gc$window)
gc$window<-sub(".*:", "", gc$window)
gc$window_start<-sub("-.*", "", gc$window) %>% as.numeric()
gc$gc_content<-gc$gc_content %>% as.numeric()
##add data for telomere annotation
tel<-read.delim2("../analysis_and_temp_files/02_genome_annotation/SL0000003_telomere_detection.txt",header=F)[,c(1,2)]
colnames(tel)<-c("contig","position")
tel$contig <- gsub( " .*$", "", tel$contig)
tel_start_contig_list<-tel[tel$position=="forward",1] #list all contigs that have telomer at the contig start (corresponds to 'forward')
tel_end_contig_list<-tel[tel$position=="reverse",1] #list all contigs that have telomer at the contig start (corresponds to 'forward')
## get
tel_start<-gc %>% select(contig,window_start) %>% mutate(telomere="absent") %>% group_by(contig) %>% arrange(window_start) %>% filter(row_number()<50 & contig %in% tel_start_contig_list) %>% mutate(telomere="present") %>% ungroup()
tel_end<-gc %>% select(contig,window_start) %>% mutate(telomere="absent") %>% group_by(contig) %>% arrange(window_start) %>%
dplyr::slice(tail(row_number(), 50)) %>% filter(contig %in% tel_end_contig_list) %>% mutate(telomere="present") %>% ungroup()
gc<-gc %>% left_join(rbind(tel_start,tel_end))
## Joining with `by = join_by(contig, window_start)`
gc$telomere[is.na(gc$telomere)]<-"absent"
##visualize
ggplot(gc)+
geom_tile(aes(y=fct_reorder(contig,window_start),x=window_start,color=gc_content,height = 0.8))+
geom_tile(aes(y=fct_reorder(contig,window_start),x=window_start,alpha=telomere,height=0.8),fill="red")+
xlab("")+ylab("")+
scale_alpha_discrete(range=c(0,1))+
scale_color_viridis(begin = 1,end = 0)+
scale_x_continuous(breaks = c(0,2000000,4000000,6000000),
labels = c("0","2 Mbp","4 Mbp","6 Mbp"))+
theme_minimal()
## Warning: Using alpha for a discrete variable is not advised.

5.2 Repeat masking
- First, make a repeat database
source package 85eb6fb0-3eb7-43b2-9659-49e0142481fc
BuildDatabase -name analysis_and_temp_files/02_genome_annotation/SL0000003_repeatdb data/SL0000003.fasta
sbatch --mem=40G -c 20 --partition="tsl-long" --wrap="source package 85eb6fb0-3eb7-43b2-9659-49e0142481fc; RepeatModeler -database analysis_and_temp_files/02_genome_annotation/SL0000003_repeatdb -pa 20 -LTRStruct >& analysis_and_temp_files/02_genome_annotation/SL0000003_repeatmodeler.out"
source package /tsl/software/testing/bin/repeatmasker-4.0.9
RepeatMasker -pa 5 -a -s -gff -xsmall -lib analysis_and_temp_files/02_genome_annotation/SL0000003_repeatdb-families.fa data/SL0000003.fasta &> analysis_and_temp_files/02_genome_annotation/repeatmasker_SL0000003.run.out
==================================================
file name: SL0000003.fasta
sequences: 20
total length: 63705246 bp (63687646 bp excl N/X-runs)
GC level: 50.26 %
bases masked: 6735955 bp ( 10.57 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 0 0 bp 0.00 %
ALUs 0 0 bp 0.00 %
MIRs 0 0 bp 0.00 %
LINEs: 4003 1395608 bp 2.19 %
LINE1 644 482141 bp 0.76 %
LINE2 27 5119 bp 0.01 %
L3/CR1 0 0 bp 0.00 %
LTR elements: 1229 467903 bp 0.73 %
ERVL 0 0 bp 0.00 %
ERVL-MaLRs 0 0 bp 0.00 %
ERV_classI 0 0 bp 0.00 %
ERV_classII 0 0 bp 0.00 %
DNA elements: 223 117885 bp 0.19 %
hAT-Charlie 0 0 bp 0.00 %
TcMar-Tigger 0 0 bp 0.00 %
Unclassified: 14514 3575967 bp 5.61 %
Total interspersed repeats: 5557363 bp 8.72 %
Small RNA: 370 170075 bp 0.27 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 17880 894185 bp 1.40 %
Low complexity: 2167 107672 bp 0.17 %
==================================================
5.3. Gene prediction
- Rename contigs to avoid problems later
singularity run ../singularity/funannotate.sif funannotate sort -i data/SL0000003.fasta.masked -b scaffold -o data/SL0000003.fasta.masked.renamed
source genemark_ES_ET_EP-4.62_CBG
mkdir -p analysis_and_temp_files/02_genome_annotation/SL0000003_genemark
cd analysis_and_temp_files/02_genome_annotation/SL0000003_genemark
gmes_petap.pl --ES --max_intron 3000 --soft_mask 2000 --cores 20 --sequence ../../../data/SL0000003.fasta.masked.renamed
cd ../../../
- Funannotate predict. Since for this genome,we don’t have RNA data, I supplied proteins from GTX0536 as additional protein evidence on top of uniprot
sbatch --mem=50G -c 20 --partition="tsl-long" --wrap="singularity run ../singularity/funannotate.sif funannotate predict -i data/SL0000003.fasta.masked.renamed -o analysis_and_temp_files/02_genome_annotation/SL0000003_pred --species 'Trebouxia SL0000003' --cpus 20 --strain SL0000003 --optimize_augustus --genemark_gtf analysis_and_temp_files/02_genome_annotation/SL0000003_genemark/genemark.gtf --organism other --busco_db chlorophyta_odb10 -d /tsl/scratch/gol22pin/singularity/funannotate2/opt/databases --busco_seed_species chlamydomonas --weights genemark:1 --protein_evidence ../singularity/funannotate2/opt/databases/uniprot_sprot.fasta analysis_and_temp_files/02_genome_annotation/GTX0536_pred/predict_results/Trebouxia_sp._A48.proteins.fa"
[Apr 07 12:08 PM]: 10,278 total gene models from EVM
[Apr 07 12:08 PM]: Generating protein fasta files from 10,278 EVM models
[Apr 07 12:08 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[Apr 07 12:08 PM]: Found 350 gene models to remove: 3 too short; 0 span gaps; 347 transposable elements
[Apr 07 12:08 PM]: 9,928 gene models remaining
[Apr 07 12:08 PM]: Predicting tRNAs
[Apr 07 12:09 PM]: 133 tRNAscan models are valid (non-overlapping)
[Apr 07 12:09 PM]: Generating GenBank tbl annotation file
[Apr 07 12:10 PM]: Collecting final annotation files for 10,061 total gene models
5.4. Functional annotations
source package 0dd71e29-8eb1-4512-b37c-42f7158718f4
source package /tsl/software/testing/bin/gcc-5.2.0
source package 999eb878-6c39-444e-a291-e2e0a86660e6
source package /tsl/software/testing/bin/java-11.0.7
source package 0f2514dd-8288-47ed-96cd-80905f9b0644
source package /tsl/software/production/bin/perl-5.16.2
mkdir -p analysis_and_temp_files/02_genome_annotation/SL0000003_pred/interpro
rm -rf temp
#tried newer version InterProScan-5.52-86.0
#CDD-3.18,Coils-2.2.1,Gene3D-4.3.0,Hamap-2020_05,MobiDBLite-2.0,PANTHER-15.0,Pfam-33.1,PIRSF-3.10,PIRSR-2021_02,PRINTS-42.0,ProSitePatterns-2021_01,ProSiteProfiles-2021_01,SFLD-4,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0
source package 0dd71e29-8eb1-4512-b37c-42f7158718f4
interproscan.sh -i analysis_and_temp_files/02_genome_annotation/SL0000003_pred/predict_results/Trebouxia_SL0000003_SL0000003.proteins.fa -d analysis_and_temp_files/02_genome_annotation/SL0000003_pred/interpro -dp -f XML -goterms -dra -cpu 20
singularity run ../singularity/funannotate.sif funannotate annotate -i analysis_and_temp_files/02_genome_annotation/SL0000003_pred/ --iprscan analysis_and_temp_files/02_genome_annotation/SL0000003_pred/interpro/Trebouxia_SL0000003_SL0000003.proteins.fa.xml --cpus 20 --sbt analysis_and_temp_files/02_genome_annotation/template.sbt --busco_db /tsl/data/busco_lineages/chlorophyta_odb10 --rename SL003
[Apr 08 09:36 AM]: Annotation consists of: 10,061 gene models
[Apr 08 09:36 AM]: 9,928 protein records loaded
[Apr 08 09:36 AM]: Running HMMer search of PFAM version 35.0
[Apr 08 09:38 AM]: 10,572 annotations added
[Apr 08 09:38 AM]: Running Diamond blastp search of UniProt DB version 2023_01
[Apr 08 09:39 AM]: 448 valid gene/product annotations from 735 total
[Apr 08 09:39 AM]: Install eggnog-mapper or use webserver to improve functional annotation: https://github.com/jhcepas/eggnog-mapper
[Apr 08 09:39 AM]: No Eggnog-mapper results found.
[Apr 08 09:39 AM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.88
[Apr 08 09:39 AM]: 448 gene name and product description annotations added
[Apr 08 09:39 AM]: Running Diamond blastp search of MEROPS version 12.0
[Apr 08 09:39 AM]: 332 annotations added
[Apr 08 09:39 AM]: Annotating CAZYmes using HMMer search of dbCAN version 11.0
[Apr 08 09:39 AM]: 187 annotations added
[Apr 08 09:39 AM]: Annotating proteins with BUSCO /tsl/data/busco_lineages/chlorophyta_odb10 models
[Apr 08 09:41 AM]: 1,281 annotations added
[Apr 08 09:41 AM]: Skipping phobius predictions, try funannotate remote -m phobius
[Apr 08 09:41 AM]: Skipping secretome: neither SignalP nor Phobius searches were run
[Apr 08 09:41 AM]: 0 secretome and 0 transmembane annotations added
[Apr 08 09:42 AM]: Parsing InterProScan5 XML file
[Apr 08 09:44 AM]: Found 0 duplicated annotations, adding 51,915 valid annotations